Goto

Collaborating Authors

 action recognition


An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

arXiv.org Machine Learning

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.



tTake out?Ground truth: Put down a cheeseGround truth: Take out a sauce(a) Importance of spatialunderstanding(b) Importance of temporalunderstandingtPut in?ttPut in?Milk carton? Cheese? Ketchup?

Neural Information Processing Systems

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.




Relational Self-Attention: What's Missing in Attention for Video Understanding Supplementary Material

Neural Information Processing Systems

We use TSN-ResNet [11] as our backbone (see Table 1) and initialize it with ImageNet-pretrained weights [4]. We replace its 7 spatial convolutional layers with the RSA layers; for every two ResNet blocks from the third block in res2 to the second block in res5, each spatial convolutional layer is replaced with the RSA layer. For the bottlenecks including RSA layers, we randomly initialize weights using MSRA initialization [3] and set the gamma parameter of the last batch normalization layer to zero. We resize the resolution of each frame to 240 320, and apply random cropping as 224 224, scale jittering, and random horizontal flipping for data augmentation. Note that we do not flip videos of which action labels include'left' or'right' words, e.g., 'pulling something from left to right'.